Regression with a Tabular Gemstone Price Dataset

About Dataset

The dataset for this competition (both train and test) was generated from a deep learning model trained on the Gemstone Price Prediction dataset. The goal is to predict price of the cubic zirconia.

There are 9 independent variables (including id):

Target Varibale:

Metrics:

What was done in this notebook?

Outlines

1. Import Necessary Libraries

2. EDA

2.1. Dataset Overview

Read Dataframe, and explore data's shape & distribution of missing values & info:

depth has missing values

2.2. Univariate Analysis

We got 9 columns: 6 numeric columns & 3 categorical columns.


Mainly two parts in the section: Analysis on numeric columns; Analysis on categorical columns.

2.2.1. Explore Categorical Columns

Categorical Columns in the dataset:

Explore categorical columns by plotting barplot.

From above:

  1. Most of the cubic zirconia stones in the dataset are colorless.
  2. The dataset predominantly consists of cubic zirconia stones with ideal and premium cuts.
  3. The dataset primarily consists of cubic zirconia stones with clarity grades of SI1 and VS2, indicating that the majority of the stones are of high quality.

2.2.2. Explore Numeric Columns

Numeric Columns in the DataFrame:

Explore numeric in the dataset by plotting distplot.

From above:

  1. Most of the cubic zirconia stones in the dataset have a weight of less than 1 carat.

  2. The dataset primarily consists of cubic zirconia stones with table percentages predominantly ranging from 55% to 60%, with 56% and 57% being the most frequent values.

  3. Since all distributions are quite similar, it's reasonable to concatenate the training data and related training data before training.

2.3. Bivariate Analysis

Analysis on the relationship of Price between Cut, Clor and Clarity by plotting boxtplot.

We can try using the median of the "Price" grouped by each categorical feature as the basis for encoding categorical features.

3. Feature Enfineering

3.1. Numerical Columns Processing (Generate New Feature)

3.2. Categorical Feature Encoding

Step1 : Identifying features that are highly correlated with label.

Step2 : Performing a groupby operation based on each categorical feature, calculating the median, and standardizing the results.

Step3 : Summing the results obtained for each categorical feature and sorting them.

Step4 : The sorted result will be the encoding for the categorical feature.

4. Modeling

4.1. Base Model

LGBMRegressor, XGBRegressor and CatBoostRegressor have better performance.

4.2. Model Fine-Tuning

4.2.1. LGBMRegressor

4.2.2. XGBRegressor

4.2.3. CatBoostRegressor

4.2.4. Collect Best Parameters

5. Ensemple Stacking